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Archiving  the  Internet: 
Towards  a  Core  Internet  Service 


Other  Repositories 

^  Library  of  Alexandria:  800GB  (400k  scrolls  @2MB) 

-f  Library  of  Congress:  20TB  (20M  books,  ascii) 

-f  Dialog  Information  Service:  3-5TB 

^  Video  Store:  STB  (5k  videos,  IGB/hr) 

-¥  Public  Branch  Library:  3TB  (300k  scanned  books) 

>  Radio  Station:  1TB  (15k  hrs  of  music)   

>  .  .  .  Internet  Archive:  1-lOTB  <4f^'""^'if\ 


Brewster  Kahle 
President,  Internet  Archive 
brewster@archive.org 
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What  is  it? 

Who  will  care? 

Is  it  Possible? 

Why  do  we  want  to  do  it? 


1 


Our  Mission  is  to 


^  Gather,  Archive,  and  Serve  all  public  Internet 
information  (WWW,  Netnews,  Gopher,  and 
Usage  Logs) 


Offering  for  the  first  time  .  .  . 

^  Reliability  (Backing  store  for  Net  resources) 

>^  Accountability  (Official  copy  of  record) 

>  Durability  (Library  for  Internet  research 
community) 

-  Demographics,  clustering,  indexing 
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Who  Will  Care? 

>  Users:  reliable  access  to  the  Net  resources 
(built  into  browsers  and  proxies) 

Scholars/Historians:  understanding  the  new 
medium 

>  Marketers:  demographic  treasure-drove 

^  Entrepreneurs:  basis  for  new  value-ad 
services 


Is  it  Doable? 

Legal/Social  Issues: 
-f  Privacy 

4  Copyright/licensing 

-  Export/Pornography 
4-  Technical: 

-  Gathering 

-  Storage 

-  Access 


Gathering 


Methods:  Crawling,  Tape  donations, 
Satellite  receiver 

Technology:  Tuned  machines,  mostly 
custom  software 


>  Speed:  T3(45Mb/s)  =  500  GB/day,  660/GB 


Storage 

^  Mitra's  Law  of  Archiving:  For  every  dollar 
they  spend,  we  can  only  spend  a  nickel 

^  Disks:  $200/GB,  RAID:  $500/GB 

^  Luckily,  tape  costs  recently  plummeted 
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Storage:  DLT  7700 


-f  #  tapes: 
-f  #  drives: 
>  Storage: 
^  Cost: 
^Cost/GB: 
>■  Speed: 


Compressed  (for  native,  divide  by  2) 


Storage:  ATL  Odetics  452 


-f  #  tapes: 

52 

#  drives: 

4 

^  Storage: 

3.6TB* 

^  Cost: 

$54k 

>Cost/GB: 

>  Speed: 

40]As 

Corr^ressed  (for  native,  divide  by  2) 
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Storage:  ATL  Odetics  2640 


>  #  tapes: 

264 

-f  #  drives: 

3 

4-  Storage: 

18TB* 

>  Cost: 

$100k 

^  Cost/GB: 

$6/GB* 

>  Speed: 

30MB/S* 

Compressed  (for  native,  divide  by  2) 


Therefore,  We  have  the  Technology 

^  Gathering  10TB  takes  20  days,  $10k  raw 
bandwidth,  custom  software  (+  CPUs) 

^  Storing  10TB  takes  $100k  for  robot, 
ingenuity  (+  fast  I/O) 

■f  Public  Access  takes  calculations  and  fresh 
ideas  ^ — ^ 
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Where  does  the  Technology  Lead? 

-f  Intranet  applications 

>  Video  Storage/Servers 
^  Data  mining 

>  International  Internet  Centers 

4-  Towards  an  "Internet  Operating  System" 
of  backup,  cache  consistency,  accounting, 
directory,  file  storage  .  .  .  x"^^^ 


Impact  of  the  Archive 

Transition  the  Net  from  Ephemera  to  an 
Enduring  Medium 

>  Inject  extra  computing  services  for  navigation, 
reliability,  coordination 

^  Build  lasting  position  IN  the  Net  (not  ON  the^ 
Net)  /C^^^'^ 
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Building  a  Library  that  can  Think. 


What  does  it  take? 

^  Bandwidth 
>  Computes 

Smarts 
^  Gumption 


